Global Music

INTRODUCTION

In today’s society, music is the language of communication, and most musicians compose music to convey a specific message to politicians and public figures.

It’s fascinating to listen to music that everyone else is listening to based on your playlist, and the Spotify API we acquired from canvas contains a wealth of information that allows us to determine the popularity of the song.

The goal of this project is to analyze and visualize data from the GlobalMusicData database. The data set includes detailed information about the performers, as well as their tracks, genres, and playlists. Since 1993, the GlobalMusicData data set has contained information on track names, albums, playlists, genres, and much more for various artists.

Through the processes below, we utilized R to perform data analysis and visualization to investigate and detect trends in the artists’ recordings, as well as uncover insights to understand through the following steps:

More Data

For more information about http Spotify click here:API

Packages Required

library(readr)  #will be used to read csv file
library(plotly) # will be used to make interactive, publication-quality graphs.
library(tidyr) # will be used to tidy up data
library(GGally) #extension of ggplot2 with functions
library(prettydoc) # used to document themes for R Markdown
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(lubridate) # used for date/time functions
library(magrittr) # used for piping
library(ggplot2) # used for data visualization
library(dplyr) # used for data manipulation

Data Preparation

The code used to assess the variables in the raw data is as follows. We discovered that the data set has 32,833 observations and 33 variables, which are given below.

# Importing the data
data <- read.csv("Global Music Data.csv", header = TRUE, sep = ",")

Data Cleaning

Within a dataset, correcting or eliminating incorrect, corrupted, improperly formatted, duplicate, or incomplete data. There are numerous ways for data to be duplicated or mislabeled when merging multiple data sources.

#Computing summary statistics for the variables
datatable(
  summary(data)
)
#Identifying the data types of each variable
datatable(
  str(data)
)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "14/6/2019" "13/12/2019" "5/7/2019" "19/7/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#dinfing missing data

#number of missing values in this data frame.
sum(is.na(data))
## [1] 15
#Count the number of missing values per column
colSums(is.na(data))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Removing missing values

In order to work on a clean dataset the data collected was cleaned

It’s straightforward to remove incomplete records from your analysis by passing your data frame or matrix through the na.omit() method. It’s a quick approach to get rid of na values in r.

#Remove missing data
#store new cleaned data to data1
data1 <- na.omit(data) 

Return Column Names of the data

#Return the column names without missing values
names((colSums(is.na(data))>0))
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Read first 10 lines

# Read first 10 rows of the cleaned data set

datatable(head(data1, 10),options = list(scrollX=TRUE, pageLength=5))

Read last 10 lines of the cleaned data

# Read last 10 rows of the cleaned data set

datatable(tail(data1, 10),options = list(scrollX=TRUE, pageLength=5))

Proposed Data Visualization and Exploratory Data Analysis

Plots can also be embedded, for example:

pairs(~danceability+energy+key+loudness,data = data1,
   main = "Scatterplot Matrix For GlobalMusicData")

1.For Danceability, energy,key, loudness

From the scatter plots above:

pairs(~acousticness+liveness+tempo+instrumentalness,data = data1,
   main = "Scatterplot Matrix For GlobalMusicData")

2.Accousticness, Liveness,Tempo,Instrumentalness

From the scatter plot above:

pairs(~mode+speechiness+duration_ms,data = data1,
   main = "Scatterplot Matrix For GlobalMusicData")

3.Mode, Speechiness, Duration_ms

From the scatter plot above:

ggplot(data1, aes(x = playlist_genre,y=track_popularity)) +
  #set limits
  scale_y_continuous(labels = scales:: comma) +
#customize bars 
 geom_bar(color="black",
           fill = "pink",
           width= 0.5,
           stat='identity') +
#adding values numbers
  geom_text(aes(label = track_popularity), 
            vjust = -0.25) +
#customize x,y axes and title
  ggtitle("Graph showing popularity Playlist genre") +
  xlab("Playlist genre") + 
  ylab("Popularity of the Track") +
#change font
  theme(plot.title = element_text(color="black", size=14,          face="bold", hjust = 0.5 ),
       axis.title.x = element_text(color="black", size=11, face="bold"),
       axis.title.y = element_text(color="black", size=11, face="bold"))

Histogram view

below showed the polurarity of music through genre

##Histogram


ggplot(data1, aes(x=playlist_genre)) +geom_bar()

Popularity through the playlist subgenre niche:

##Histogram


ggplot(data, aes(x=playlist_subgenre)) +geom_bar() +coord_flip()

Box plots

# Box plots


bp <- ggplot(data, aes(x=duration_ms, y=playlist_genre, fill=playlist_genre)) + 
  geom_boxplot()+
  labs(title="Plot of Duration against playlist genre",x="Duration in (ms)", y = "Playlist genre")
bp + theme_classic()

PLOT 2;

Playlist_subgenere vs duration:

# Box plots


bp <- ggplot(data, aes(x=duration_ms, y=playlist_subgenre, fill=playlist_subgenre)) + 
  geom_boxplot()+
  labs(title="Plot of Duration against playlist subgenre",x="Duration in (ms)", y = "Playlist subgenre")
bp + theme_classic()

Summary

In summary from the box chart Rock. edm, rnb, and rap were the most popular genre in todays music industry. the use of cleaned data made it easy to work on graph calculations and also get a consise numerical data for the Global Music dataset. the use of summary computations, graphic visualization such as bar chart, histogram and box chart made it easier to address the problem statements.

Interesting insights of the analysis

Limitations

  1. Data Processing The items are stored in physical memory in R. In contrast to other languages such as Python, this is not the case. Furthermore, when compared to Python, R uses more memory. R also mandates that all data be stored in a single location, namely memory. As a result, while dealing with Big Data, it is not the best option. However, with data management packages and Hadoop connectivity, this is readily addressed.

  2. Safety and Security R is insecure in many ways. Most programming languages, such as Python, include this functionality. As a result, R has a number of limitations, including the inability to be incorporated in a web application.

  3. Difficult Language R is a difficult language to master. The learning curve is quite steep. owing to